Overview

CTML Take-Home Test

This test consists of two challenges, each to be completed using Python and the usual data science stack, and each intended to take no more than hour.

Business Objective & Challenge

1: Breast Cancer Dataset EDA

The “wdbc” data-set you will be working with can be downloaded from here. Please produce a well-presented Jupyter notebook – including any visualisations you feel are useful – which addresses the following:

a. What are the mean, median and standard deviation of the “perimeter” feature?

b. Is the first feature in this data set (the “radius”) normally distributed? Quantify your answer. If not, what might be a more appropriate distribution?

c. Train a classifier to predict the diagnosis of malignant or benign. Compare the results of two classifiers e.g. SVM, logistic regression, decision tree etc.

ML Development Process

Import Packeges

The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks.

Data Reading

Attribute Information:

  1. Number of instances: 569

  2. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)

  3. Attribute information

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

Several of the papers listed above contain detailed descriptions of how these features are computed.

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

  1. Missing attribute values: none

  2. Class distribution: 357 benign, 212 malignant

Data Quality

Summarizing Missing Values

Data Exploration and Data Preparation

In this section the dataset will be analyzed to come up with the main characteristics that could help to understand the nature of the data. The analysis will assist to find out which features are more crucial in predicting whether a sample is cancerous or not.

Descriptive Statistics

Descriptive Statistics for Continuous variable

sq = np.sqrt(len(df)) df['perimeter_std'] = df['perimeter_SE']*sq

Calculating Standard Deviation \begin{aligned} &\text{standard deviation } \sigma = \sqrt{ \frac{ \sum_{i=1}^n{\left(x_i - \bar{x}\right)^2} }{n-1} } \\ &\text{variance} = {\sigma ^2 } \\ &\text{standard error }\left( \sigma_{\bar x} \right) = \frac{{\sigma }}{\sqrt{n}} \\ &\textbf{where:}\\ &\bar{x}=\text{the sample's mean}\\ &n=\text{the sample size}\\ \end{aligned} ​
standard deviation σ= n−1 ∑ i=1 n ​ (x i ​ − x ˉ ) 2



variance=σ2


where: x ˉ =the sample’s mean n=the sample size ​

df.iloc[:,2:12] testx = df.iloc[:,[2,12]] testx.columns

Challenge 1 What are the mean, median and standard deviation of the “perimeter” feature?

Challenge 2 Is the first feature in this data set (the “radius”) normally distributed? Quantify your answer. If not, what might be a more appropriate distribution?

# We fit a probability distribution to radius, then test the nomality. # Anderson-Darling Test, Kolmogorov–Smirnov test and Shapiro-Wilk test are an option and the widely used one.

The QQ plot showing the scatter plot of points in a diagonal line, Not fitting the expected diagonal pattern for a sample from a Gaussian distribution.

# Hence the variable radius not normally distributed.
# The best fit should be ***genextreme***,the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions.

Categorical Variable (Target Variable)

# In this section, The binary target variable can act as a helpful tool to better understand the data distribution.

We are dealing with three measures: mean, std. error and the 'worst'. We should look at each separatedly.

We can plot the distribution and look for skewness on the mean values but there is not much we can obtain from plotting distributions of the other two:

Std. error is already a parameter obtained from a distribution and only have positive values so it's likely that we find it to be right-skewed. Worst is a biased subsample of the measure's samples

The Central Limit Theroem states that the distribution of the mean values should look like a normal distribution. Let's explore that.

We can see that some features are pretty skewed. We can measure its skewness using pandas skew method and we can try comparing it to a log transformation of the same values to see if we can reduce the skewness.

We managed to greatly reduce skewness on radius, texture ,Perimeter and concavity. The other features were barely influenced by our log transformation.

Mean Features Plot

There are a few things to point out on this plot.

From all histograms in the diagonal plots of the grid, only the fractal dimension has no visible effect on the square of the tumor. This is also seen in all the plots of the last row / column. This is convenient because the fractal dimension is the unexplained oblique feature that we have just talked about. This latter is a strong candidate for feature selection. The second lowest visual effect (I am saying visual because we will see some numbers later) is on symmetry. All other characteristics have a significant effect on the classification of tumors and the scatterplots look quite 'separable'. We can also see some 'beautiful plots' on the respective geometric features radius, area, and circumference, which are to be expected. This high correlation between features can be a problem for few ML algorithms.

Error Features Plot

We didn't expect to find anything here - and most feature errors have no effect - but look at radius ,perimeter , area and compactness. The data suggest that the greater the error in these characteristics, the greater the likelihood of malignant tumors. How can we explain it?

Let us remember how standard error is calculated: by dividing the standard deviation by the square root of the sample size.

SE = σ / n

I will assume that the sample size does not change for each tumor sample, so the variance of the standard error is only due to the standard deviation.

Assuming that this is the case, we can interpret that there is a high irregularity on the geometry of the malignant tumor, which causes a high standard deviation!

Worst Features Plot

These plots are very similar to the previous ones. This should be expected because the worst features are sub-samples of average data. Only by looking at those scenes is it difficult to say which is more important for the prediction model. We need to get some numbers to see if there is a significant difference.

Correlations

To calculate the correlations we can use the pandas corr method.

To visualize it better we can use the classic seaborn's heatmap - which is perfectly fine - but I will plot it using horizontal bar charts.

The downside of not plotting a heatmap is that we do not see how features are correlated to each other: there might be redundant features we don't need to feed a machine learning model. We can already see highly correlated features from our previous plots (e.g. Perimeter and Area), but I've chosen to keep them all and let the algorithms decide for them selves which ones are important and which ones aren't (feature selection and regularization).

Correlation by Feature Type

First, lets see if we can find a predominant type of feature (worst, mean or se). Did we visualize it correctly in the previous plots?

There is some stronge correlations: Features like radius_mean, Perimeter_mean, area_mean, (something)_se, (something)_worst, have a natural correlation because all these are generated using same data.We can see that there are many columns which are very highly correlated which causes multicollinearity so we have to remove highly correlated features.

Positive Correlation:

Radius, Perimeter and Area have stronge positive correlation

Radius have a strong positive correlation with Concave Points

Compacteness, Concavity and Concave Points have strong positive correlation

Negative Correlation

Fractal Dimention have some negative correlation with Radius, Perimeter and Area

Insights from the plot:

We can observe that for all features except for Fractal Dimension have a similar pattern: The two highest correlated feature types are WORST and MEAN and the lowest is the STANDARD ERROR. Fractal dimension has been the exception since all that matters in terms of this feature are the worst measures. That said, we must remember that Pearson's correlation can only measure two individual features and we can't see how the combination of features influence in our target. As We've mentioned: We will keep them and let my model decide.

Feature engineering

We are Creating a Volume Mean Feature using radius_mean

Feature Scaling

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Features Distribution

In swarm plot, We will do different part like violin plot not to make plot very complex appearance
We can see in these plots which feature looks like more clear in terms of classification. Some of the features in swarm plot looks like malignant and benign are seprated not totaly but mostly. Hovewer, Some of the features in swarm plot looks like malignant and benign are mixed so it is hard to classfy while using this feature. Class seperated variable are Plot 9.3a: radius_mean , perimeter_mean, area_mean , compactness_mean , concavity_mean ,concave_points_mean Plot 9.3b: radius_SE, perimeter_SE,area_SE Plot 9.3c: radius_Worst,perimeter_Worst, area_Worst, compactness_Worst,concavity_Worst, concave_points_Worst Plot 9.3d: radius , perimeter , area , compactness , concavity ,concave_points

Model development

Data Partition :Splitting the Data for training and testing

Imbalanced Learning

Feature Selection using Recursive feature elimination (RFE)

Feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. SelectKBest removes all but the k highest scoring features

Recursive feature elimination (RFE) with random forest : It uses one of the classification methods that assign weight to each attribute. Those whose absolute weight is the smallest are cut from the set of current features. That process is iteratively repeated over the set up to the desired number of features.

#Recursive feature elimination with cross validation and random forest classification
We found Optimal number of features are 24 which include Best features : 'texture_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'radius_SE', 'area_SE', 'radius_Worst', 'texture_Worst', 'perimeter_Worst','area_Worst', 'smoothness_Worst', 'compactness_Worst', 'concavity_Worst', 'concave_points_Worst', 'symmetry_Worst', 'std4', 'dev4', 'dev7', 'radius', 'texture', 'perimeter','mesuraments_sum_mean'.

Classification Model

we are going to build a classification model and evaluate its performance using the training set.

Baseline Model

We are using Logistic regression , Support vector machine and Decision tree technique to train a classifier to predict the diagnosis of malignant or benign.

case 1 : select all Features

case 2 : Select feaures based on Recursive feature elimination (RFE) with random forest

# Key finding : When we compare baseline model and feaures selection based on Recursive feature elimination (RFE). It shows that feature selection based on Recursive feature elimination (RFE) improve model performance.

Bayesian Optimization Hyperparameter Tuning

Final Model with feature selection and Hypertunning

Logistic Regression Model

SVM Model

Decision Tree Classifier

Our model accurately labeled 95% of the test data. However this is just the beginning. We can also try to increase the accuracy even more by using an algorithm other than logistic regression and support vector machine, or try our model with a set of different variables. There are certainly many more things that can be done to modify our model, but I end this report here. This analysis found that the best model used for breast cancer diagnosis are the support vector machine , logistic regression model with the top 17 predictors. We got cross-validation score of ~96% for the data set for SVM and logistic regression and prediction accuracy of ~95% on test datasets which indicate good accuracy without overfitting model. We observed below variable play important role in classify to predict the diagnosis of malignant or benign -perimeter_Worst -concave_points_Worst -concave_points_mean -area_Worst -radius_Worst Training Data :LogisticRegression Accuracy: 94.74% Cross validation score: 96.38% (+/- 4.68%) Test Data : Accuracy 95% Training Data :SVC Accuracy: 94.74% Cross validation score: 95.69% (+/- 5.23%) Test Data : Accuracy 95% Training Data : Dedicion Tree Accuracy: 89.47% Cross validation score: 88.10% (+/- 5.81%) Test Data : Accuracy 89%

Challenge #3: Spearman’s Footrule Distance

Challenge #2: Spearman’s Footrule Distance Suppose we have several different methods for scoring a set of items; perhaps we’re asking different people, or using different scoring algorithms. We’d like to figure out how to aggregate these to produce a single combined ranking. A useful tool here is Spearman’s Footrule Distance which computes the distance between two rankings (Don’t worry, we don’t expect you to have heard of this before, we expect you to do some Googling…) Your task here is to implement a function with the following signature: def sumSpearmanDistances(scores, proposedRanking): “””Calculate the sum of Spearman’s Footrule Distances for a given proposedRanking. scores : A dict of {itemId: tuple of scores} e.g. {‘A’: [100, 0.1], ‘B’: [90, 0.3], ‘C’: [20, 0.2]} means that item ‘A’ was given a score of 100 by metric 1 and a score of 0.1 by metric 2 etc proposedRanking : An ordered list of itemIds where the first entry is the proposed-best and last entry is the proposed worst e.g. [‘A’, ‘B’, ‘C’] “”” Please think about splitting your function into appropriate sub-functions and add tests to demonstrate that everything works as expected. You may assume in your implementation that higher score = better. You can implement this as a Jupyter notebook, or a standalone Python module.
# Final Result : This output shows final ranking based on Spearman’s Footrule Distance